Error analysis and confidence measure of Chinese word segmentation

نویسندگان

  • Chih-Chung Kuo
  • Kun-Yuan Ma
چکیده

Word segmentation for a Chinese sentence is essential for many applications in language and speech processing. There’s no perfect method that could achieve word segmentation without any errors. We propose a confidence measure for the segmentation result to cope with the problem caused by the errors. The effective method depends mainly on the error analysis of the word segmentation. With the confidence measure the suspected errors can be identified such that manual inspection loads can be largely reduced for non-real-time applications. A soft-decision method and a composite-word approach for prosody generation are also designed for text-tospeech systems by exploiting the confidence measure, such that the wrong prosody caused by wrong word boundaries can be alleviated.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Word Language Model Evaluation Metric for Character Based Languages

Perplexity is a widely used measure to evaluate word prediction power of a word-based language model. It can be computed independently and has shown good correlation with word error rate (WER) in speech recognition. However, for character based languages, character error rate (CER) is commonly used instead of WER as the measure for speech recognition, although language model is still word based...

متن کامل

Statistical Models for Word Segmentation And Unknown Word Resolution

In a Chinese sentence, there are no word delimiters, like blanks, between the “words”. Therefore, it is important to identify the word boundaries before processing Chinese text. Traditional approaches tend to use dictionary lookup, morphological rules and heuristics to identify the word boundaries. Such approaches may not be applied to a large system due to the complicated linguistic phenomena ...

متن کامل

Report to BMM-based Chinese Word Segmentor with Context-based Unknown Word Identifier for the Second International Chinese Word Segmentation Bakeoff

This paper describes a Chinese word segmentor (CWS) based on backward maximum matching (BMM) technique for the 2 nd Chinese Word Segmentation Bakeoff in the Microsoft Research (MSR) closed testing track. Our CWS comprises of a context-based Chinese unknown word identifier (UWI). All the context-based knowledge for the UWI is fully automatically generated by the MSR training corpus. According to...

متن کامل

A Probe into Ambiguities of Determinative-Measure Compounds

This paper aims to further probe into the problems of ambiguities for automatic identification of determinative-measure compounds (DMs) in Chinese and to develop sets of rules to identify DMs and their parts of speech. It is known that Chinese DMs are identifiable by regular expressions. DM rule matching helps one solve word segmentation ambiguities, and parts of speech help one improve sense r...

متن کامل

Subword-based Tagging by Conditional Random Fields for Chinese Word Segmentation

We proposed two approaches to improve Chinese word segmentation: a subword-based tagging and a confidence measure approach. We found the former achieved better performance than the existing character-based tagging, and the latter improved segmentation further by combining the former with a dictionary-based segmentation. In addition, the latter can be used to balance out-of-vocabulary rates and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998